Enable inference regret #2782

esantorella · 2024-09-24T19:26:58Z

Summary:

Context:

Currently, the benchmarks compute an "oracle" value for each point seen, which evaluates the point noiselessly and at the target task and fidelity, or in a way specified by the BenchmarkRunner. This produces an optimization_trace used for measuring performance. (For MOO, the hypervolume of all points tested is computed.)

While this trace does a good job of capturing whether a good point has been tested, it does not capture inference regret: The difference between the value of the the point the model would recommend and that of the best point. This distinction becomes important (both for getting a good measure of absolute performance and for comparing methods) in contexts such as

Bandit problems (in a noisy and discrete space), where the best point will be seen quickly; the question is when the model identifies it
Multi-fidelity problems, where simply evaluating as many small arms as possible maximizes the current metric for optimization value
Noisy problems, if different best-point selection strategies are being considered.

Open questions

Should inference value always be computed? My take: Yes, it needen't add much computational overhead, as long as evaluating the same parameterization a second time isn't expensive, because we can use a best-pont selection strategy of "empirical best." Current implementation: Always computes this.
Should the "oracle trace" (the status quo behavior) always be computed? My take: Yes, because people say they find it helpful, and for consistency with the past. Current implementation: Always computes this.
If we want both, should we tag one of the two traces as "the" trace, for backwards compatibility? The current implementation does this; BenchmarkResult.optimization_trace is one of the inference_value_trace and the oracle_trace, with the BenchmarkProblem specifying which one.
Set of best points returned for MOO: Is choosing K points and then evaluating them by hypervolume what we want?
To what degree do we want to rely on Ax's BestPointMixin functionality, which is pretty stale, missing functionality we want, requires constructing dummy Experiments, and won't do the right thing for multi-fidelity and multi-task methods? An alternative approach would be to support this for MBM only, which would address or enable addressing all these issues.
When should the trace be updated in async settings?
This diff adds support for SOO and MOO and for n_best_points, but only supports SOO with 1 best point. That's a lot of infra for raising NotImplementedErrors. Is this what we want?
In sample and out of sample: Currently, I'm not using these terms at all since they are confusing in multi-task and multi-fidelity contexts. Is that what we want?
When people develop best-point functionality in the future, would they do it be updating or adding options to BestPointMixin._get_trace? I wrote this under the assumption that they would either do that or use a similar method that consumes an experiment and optimization_config and can access the generation_strategy used.

This diff

High-level changes

Technically, this adds "inference value" rather than "inference regret", because it is not relative to the optimum. That gives it the same sign as the default optimization_trace. It is always computed and returned on the BenchmarkResult. The old trace is renamed the oracle_trace. optimization_trace continues to exist; it can be either the oracle_trace (default) or the inference_trace, depending on what the BenchmarkProblem specifies. The BenchmarkMethod is responsible for specifying a best-point selector. This currently relies heavily on Ax's best-point functionality, but this can be overridden.

There are major limitations:

The ideal approach for MOO isn't supported yet, so MOO isn't supported at all with inference value: The BenchmarkProblem specifies n_best_points, how many points are returned as the best, and for MOO, we would want n_best_points > 1 and to take the hypervolume of the oracle values at those points. That is the only way it makes sense to set this up if we want to compare best-point selectors. If we use hypervolume and don't cap n_best_points, the ideal best-point selector would give every point. Metrics other than hypervolume, such as the fraction of "best" points actually on the Pareto frontier, would also be odd. However, there is no Ax functionality generically hooked up for getting k points to maximize expected hypervolume.
Different best-point selectors can be compared by using a different BenchmarkMethod, either by passing different best_point_kwargs to the BenchmarkMethod or by subclassing BenchmarkMethod and overriding get_best_parameters.

Detailed changes

BenchmarkResult

Docstrings ought to be self-explanatory.

The old optimization_trace becomes oracle_trace
It always has an inference_value_trace as well as an oracle_trace
The optimization_trace can be either, depending on what the BenchmarkProblem specifies.

`benchmark_replication`

Computes inference value after each time the scheduler generates a trial. Note that incomplete trials can thus be used, since this could then happen before the trial completes.
For MOO, should find K pareto-optimal parameters (according to the model), get their oracle values, and get the hypervolume of those oracle values in the following manner: constructs a new experiment with one BatchTrial whose arms are K pareto-optimal parameters and whose metrics are oracle values, and uses Ax's best-point functionality to get the hypervolume. This is done to avoid re-implementing inference of objective thresholds, using constraints, weighting, etc. HOWEVER, MOO is currently unsupported because we don't have a way of getting the K best.
For SOO, finds K best parameters (according to the model) and gets their oracle value. HOWEVER, K>1 is currently unsupported.

`BenchmarkProblem`

Gets an attribute report_inference_value_as_trace that
Makes the BenchmarkResult's optimization_trace be inference value when the problem specifies that inference value should be used. Docstrings should be self-explanatory.
Adds to BenchmarkProblem. Docstrings should be self-explanatory.

`BenchmarkMethod`

Adds a method get_best_parameters and an attribute best_point_kwargs. If not overridden, get_best_parameters uses BestPointMixin._get_trace and passes it the best_point_kwargs.
Currently, the only supported argument in best_point_kwargs is "use_model_predictions".

Reviewed By: Balandat

Differential Revision: D61930178

facebook-github-bot · 2024-09-24T19:27:18Z

This pull request was exported from Phabricator. Differential Revision: D61930178

codecov-commenter · 2024-09-24T19:44:40Z

Codecov Report

Attention: Patch coverage is 95.49550% with 5 lines in your changes missing coverage. Please review.

Project coverage is 95.68%. Comparing base (8ba8ce3) to head (d62296c).

Files with missing lines	Patch %	Lines
ax/benchmark/benchmark_method.py	82.35%	3 Missing ⚠️
ax/benchmark/benchmark.py	92.85%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2782      +/-   ##
==========================================
- Coverage   95.68%   95.68%   -0.01%     
==========================================
  Files         488      488              
  Lines       47843    47943     +100     
==========================================
+ Hits        45779    45874      +95     
- Misses       2064     2069       +5

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

facebook-github-bot · 2024-09-24T19:56:39Z

This pull request was exported from Phabricator. Differential Revision: D61930178

Summary: Pull Request resolved: facebook#2782 # Context: Currently, the benchmarks compute an "oracle" value for each point seen, which evaluates the point noiselessly and at the target task and fidelity, or in a way specified by the `BenchmarkRunner`. This produces an `optimization_trace` used for measuring performance. (For MOO, the hypervolume of all points tested is computed.) While this trace does a good job of capturing whether a good point has been tested, it does not capture *inference regret*: The difference between the value of the the point the model would recommend and that of the best point. This distinction becomes important (both for getting a good measure of absolute performance and for comparing methods) in contexts such as * Bandit problems (in a noisy and discrete space), where the best point will be seen quickly; the question is when the model identifies it * Multi-fidelity problems, where simply evaluating as many small arms as possible maximizes the current metric for optimization value * Noisy problems, if different best-point selection strategies are being considered. # Open questions * Should inference value always be computed? My take: Yes, it needen't add much computational overhead, as long as evaluating the same parameterization a second time isn't expensive, because we can use a best-pont selection strategy of "empirical best." Current implementation: Always computes this. * Should the "oracle trace" (the status quo behavior) always be computed? My take: Yes, because people say they find it helpful, and for consistency with the past. Current implementation: Always computes this. * If we want both, should we tag one of the two traces as "the" trace, for backwards compatibility? The current implementation does this; `BenchmarkResult.optimization_trace` is one of the `inference_value_trace` and the `oracle_trace`, with the `BenchmarkProblem` specifying which one. * Set of best points returned for MOO: Is choosing K points and then evaluating them by hypervolume what we want? * To what degree do we want to rely on Ax's `BestPointMixin` functionality, which is pretty stale, missing functionality we want, requires constructing dummy `Experiment`s, and won't do the right thing for multi-fidelity and multi-task methods? An alternative approach would be to support this for MBM only, which would address or enable addressing all these issues. * When should the trace be updated in async settings? * This diff adds support for SOO and MOO and for `n_best_points`, but only supports SOO with 1 best point. That's a lot of infra for raising `NotImplementedError`s. Is this what we want? * In sample and out of sample: Currently, I'm not using these terms at all since they are confusing in multi-task and multi-fidelity contexts. Is that what we want? * When people develop best-point functionality in the future, would they do it be updating or adding options to `BestPointMixin._get_trace`? I wrote this under the assumption that they would either do that or use a similar method that consumes an `experiment` and `optimization_config` and can access the `generation_strategy` used. # This diff ## High-level changes Technically, this adds "inference value" rather than "inference regret", because it is not relative to the optimum. That gives it the same sign as the default optimization_trace. It is always computed and returned on the `BenchmarkResult`. The old trace is renamed the `oracle_trace`. `optimization_trace` continues to exist; it can be either the `oracle_trace` (default) or the `inference_trace`, depending on what the `BenchmarkProblem` specifies. The `BenchmarkMethod` is responsible for specifying a best-point selector. This currently relies heavily on Ax's best-point functionality, but this can be overridden. There are major limitations: * *The ideal approach for MOO isn't supported yet, so MOO isn't supported at all with inference value*: The `BenchmarkProblem` specifies `n_best_points`, how many points are returned as the best, and for MOO, we would want `n_best_points > 1` and to take the hypervolume of the oracle values at those points. That is the only way it makes sense to set this up if we want to compare best-point selectors. If we use hypervolume and don't cap `n_best_points`, the ideal best-point selector would give every point. Metrics other than hypervolume, such as the fraction of "best" points actually on the Pareto frontier, would also be odd. However, there is no Ax functionality generically hooked up for getting `k` points to maximize expected hypervolume. * Different best-point selectors can be compared by using a different `BenchmarkMethod`, either by passing different `best_point_kwargs` to the `BenchmarkMethod` or by subclassing `BenchmarkMethod` and overriding `get_best_parameters`. ## Detailed changes ### BenchmarkResult Docstrings ought to be self-explanatory. * The old `optimization_trace` becomes `oracle_trace` * It always has an `inference_value_trace` as well as an `oracle_trace` * The `optimization_trace` can be either, depending on what the `BenchmarkProblem` specifies. ### `benchmark_replication` * Computes inference value after each time the scheduler generates a trial. Note that incomplete trials can thus be used, since this could then happen before the trial completes. * For MOO, should find K pareto-optimal parameters (according to the model), get their oracle values, and get the hypervolume of those oracle values in the following manner: constructs a new experiment with one BatchTrial whose arms are K pareto-optimal parameters and whose metrics are oracle values, and uses Ax's best-point functionality to get the hypervolume. This is done to avoid re-implementing inference of objective thresholds, using constraints, weighting, etc. HOWEVER, MOO is currently unsupported because we don't have a way of getting the K best. * For SOO, finds K best parameters (according to the model) and gets their oracle value. HOWEVER, K>1 is currently unsupported. ### `BenchmarkProblem` * Gets an attribute `report_inference_value_as_trace` that * Makes the BenchmarkResult's `optimization_trace` be inference value when the problem specifies that inference value should be used. Docstrings should be self-explanatory. * Adds to BenchmarkProblem. Docstrings should be self-explanatory. ### `BenchmarkMethod` * Adds a method `get_best_parameters` and an attribute `best_point_kwargs`. If not overridden, `get_best_parameters` uses `BestPointMixin._get_trace` and passes it the `best_point_kwargs`. * Currently, the only supported argument in `best_point_kwargs` is "use_model_predictions". Reviewed By: Balandat Differential Revision: D61930178

facebook-github-bot · 2024-09-24T20:26:46Z

This pull request was exported from Phabricator. Differential Revision: D61930178

Summary: Pull Request resolved: facebook#2782 # Context: Currently, the benchmarks compute an "oracle" value for each point seen, which evaluates the point noiselessly and at the target task and fidelity, or in a way specified by the `BenchmarkRunner`. This produces an `optimization_trace` used for measuring performance. (For MOO, the hypervolume of all points tested is computed.) While this trace does a good job of capturing whether a good point has been tested, it does not capture *inference regret*: The difference between the value of the the point the model would recommend and that of the best point. This distinction becomes important (both for getting a good measure of absolute performance and for comparing methods) in contexts such as * Bandit problems (in a noisy and discrete space), where the best point will be seen quickly; the question is when the model identifies it * Multi-fidelity problems, where simply evaluating as many small arms as possible maximizes the current metric for optimization value * Noisy problems, if different best-point selection strategies are being considered. # Open questions * Should inference value always be computed? My take: Yes, it needen't add much computational overhead, as long as evaluating the same parameterization a second time isn't expensive, because we can use a best-pont selection strategy of "empirical best." Current implementation: Always computes this. * Should the "oracle trace" (the status quo behavior) always be computed? My take: Yes, because people say they find it helpful, and for consistency with the past. Current implementation: Always computes this. * If we want both, should we tag one of the two traces as "the" trace, for backwards compatibility? The current implementation does this; `BenchmarkResult.optimization_trace` is one of the `inference_value_trace` and the `oracle_trace`, with the `BenchmarkProblem` specifying which one. * Set of best points returned for MOO: Is choosing K points and then evaluating them by hypervolume what we want? * To what degree do we want to rely on Ax's `BestPointMixin` functionality, which is pretty stale, missing functionality we want, requires constructing dummy `Experiment`s, and won't do the right thing for multi-fidelity and multi-task methods? An alternative approach would be to support this for MBM only, which would address or enable addressing all these issues. * When should the trace be updated in async settings? * This diff adds support for SOO and MOO and for `n_best_points`, but only supports SOO with 1 best point. That's a lot of infra for raising `NotImplementedError`s. Is this what we want? * In sample and out of sample: Currently, I'm not using these terms at all since they are confusing in multi-task and multi-fidelity contexts. Is that what we want? * When people develop best-point functionality in the future, would they do it be updating or adding options to `BestPointMixin._get_trace`? I wrote this under the assumption that they would either do that or use a similar method that consumes an `experiment` and `optimization_config` and can access the `generation_strategy` used. # This diff ## High-level changes Technically, this adds "inference value" rather than "inference regret", because it is not relative to the optimum. That gives it the same sign as the default optimization_trace. It is always computed and returned on the `BenchmarkResult`. The old trace is renamed the `oracle_trace`. `optimization_trace` continues to exist; it can be either the `oracle_trace` (default) or the `inference_trace`, depending on what the `BenchmarkProblem` specifies. The `BenchmarkMethod` is responsible for specifying a best-point selector. This currently relies heavily on Ax's best-point functionality, but this can be overridden. There are major limitations: * *The ideal approach for MOO isn't supported yet, so MOO isn't supported at all with inference value*: The `BenchmarkProblem` specifies `n_best_points`, how many points are returned as the best, and for MOO, we would want `n_best_points > 1` and to take the hypervolume of the oracle values at those points. That is the only way it makes sense to set this up if we want to compare best-point selectors. If we use hypervolume and don't cap `n_best_points`, the ideal best-point selector would give every point. Metrics other than hypervolume, such as the fraction of "best" points actually on the Pareto frontier, would also be odd. However, there is no Ax functionality generically hooked up for getting `k` points to maximize expected hypervolume. * Different best-point selectors can be compared by using a different `BenchmarkMethod`, either by passing different `best_point_kwargs` to the `BenchmarkMethod` or by subclassing `BenchmarkMethod` and overriding `get_best_parameters`. ## Detailed changes ### BenchmarkResult Docstrings ought to be self-explanatory. * The old `optimization_trace` becomes `oracle_trace` * It always has an `inference_value_trace` as well as an `oracle_trace` * The `optimization_trace` can be either, depending on what the `BenchmarkProblem` specifies. ### `benchmark_replication` * Computes inference value after each time the scheduler generates a trial. Note that incomplete trials can thus be used, since this could then happen before the trial completes. * For MOO, should find K pareto-optimal parameters (according to the model), get their oracle values, and get the hypervolume of those oracle values in the following manner: constructs a new experiment with one BatchTrial whose arms are K pareto-optimal parameters and whose metrics are oracle values, and uses Ax's best-point functionality to get the hypervolume. This is done to avoid re-implementing inference of objective thresholds, using constraints, weighting, etc. HOWEVER, MOO is currently unsupported because we don't have a way of getting the K best. * For SOO, finds K best parameters (according to the model) and gets their oracle value. HOWEVER, K>1 is currently unsupported. ### `BenchmarkProblem` * Gets an attribute `report_inference_value_as_trace` that * Makes the BenchmarkResult's `optimization_trace` be inference value when the problem specifies that inference value should be used. Docstrings should be self-explanatory. * Adds to BenchmarkProblem. Docstrings should be self-explanatory. ### `BenchmarkMethod` * Adds a method `get_best_parameters` and an attribute `best_point_kwargs`. If not overridden, `get_best_parameters` uses `BestPointMixin._get_trace` and passes it the `best_point_kwargs`. * Currently, the only supported argument in `best_point_kwargs` is "use_model_predictions". Reviewed By: Balandat Differential Revision: D61930178

facebook-github-bot · 2024-09-24T20:33:30Z

This pull request was exported from Phabricator. Differential Revision: D61930178

facebook-github-bot · 2024-09-25T22:13:49Z

This pull request has been merged in 75b4bf8.

facebook-github-bot added the CLA Signed Do not delete this pull request or issue due to inactivity. label Sep 24, 2024

facebook-github-bot added the fb-exported label Sep 24, 2024

esantorella force-pushed the export-D61930178 branch from fc86d34 to 5b06db2 Compare September 24, 2024 19:56

esantorella force-pushed the export-D61930178 branch from 5b06db2 to e0a1512 Compare September 24, 2024 20:26

esantorella force-pushed the export-D61930178 branch from e0a1512 to d62296c Compare September 24, 2024 20:33

facebook-github-bot closed this in 75b4bf8 Sep 25, 2024

facebook-github-bot added the Merged label Sep 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable inference regret #2782

Enable inference regret #2782

esantorella commented Sep 24, 2024

facebook-github-bot commented Sep 24, 2024

codecov-commenter commented Sep 24, 2024 •

edited

Loading

facebook-github-bot commented Sep 24, 2024

facebook-github-bot commented Sep 24, 2024

facebook-github-bot commented Sep 24, 2024

facebook-github-bot commented Sep 25, 2024

Enable inference regret #2782

Enable inference regret #2782

Conversation

esantorella commented Sep 24, 2024

Context:

Open questions

This diff

High-level changes

Detailed changes

BenchmarkResult

benchmark_replication

BenchmarkProblem

BenchmarkMethod

facebook-github-bot commented Sep 24, 2024

codecov-commenter commented Sep 24, 2024 • edited Loading

Codecov Report

facebook-github-bot commented Sep 24, 2024

facebook-github-bot commented Sep 24, 2024

facebook-github-bot commented Sep 24, 2024

facebook-github-bot commented Sep 25, 2024

`benchmark_replication`

`BenchmarkProblem`

`BenchmarkMethod`

codecov-commenter commented Sep 24, 2024 •

edited

Loading